Data integration with STACAS

Introduction

STACAS is a method for scRNA-seq data integration. It is based on the Seurat integration framework, and adds the following innovations:

  • anchors are filtered/down-weighted based on their distance in reciprocal PCA space, calculated from the unscaled, normalized data
  • integration trees are constructed based on the ‘centrality’ of datasets, as measured by the sum of their anchor (distance-weighted) scores
  • cell labels, if known, can be given as input to the algorithm to perform semi-supervised integration

In this demo we will show the application of STACAS to integrate a collection of scRNA-seq datasets of immune cells from multiple donors, human tissues and studies, assembled by Luecken et al. for their excellent benchmark.

The data are available at: figshare/12420968

R environment

Get and load some useful packages

renv::restore()

if (!require("remotes", quietly = TRUE))
    install.packages("remotes")
library(remotes)

if (!requireNamespace("STACAS", quietly = TRUE))
  remotes::install_github("carmonalab/STACAS")
library(Seurat)
library(dplyr)
library(ggplot2)
library(STACAS)

seed = 1234
set.seed(seed)

Load test datasets

Download the dataset of human immune cells assembled by Luecken et al., and convert them to Seurat objects.

download <- F
where <- 'aux'
dir.create(where, showWarnings = FALSE)

rds.path <- sprintf("%s/Immune_ALL_human.rds", where)

if(download){

  options(timeout=500)
  url <- "https://figshare.com/ndownloader/files/25717328"
  h5.path <- sprintf("%s/Immune_ALL_human.rds", where)
  download.file(url = url, destfile = h5.path)
  
  if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

  if (!require("zellkonverter", quietly = TRUE))
    install.packages("zellkonverter") # to convert from h5ad to R object
  
  #Convert to Seurat 
  object.sce <- zellkonverter::readH5AD(h5.path)
  object <- Seurat::as.Seurat(object.sce, counts = "counts", data = "X")
  object <- RenameAssays(object, originalexp="RNA")
  rm (object.sce)
  
  Idents(object) <- "final_annotation"
  saveRDS(object = object,file = rds.path)

}else{
  object <- readRDS(rds.path)
}

Cell types were annotated by the authors on each dataset individually, using a common dictionary of cell types (see https://github.com/theislab/scib-reproducibility). These are stored in the final_annotation metadata column. Study of origin is stored in batch metadaa

table(object$final_annotation, object$batch)[,1:5]
##                                   
##                                     10X Freytag Oetjen_A Oetjen_P Oetjen_U
##   CD4+ T cells                     2937    1238      577      295     1652
##   CD8+ T cells                      350     270       93      639      253
##   CD10+ B cells                       0       0       91       41       75
##   CD14+ Monocytes                  3388     452      350      263      384
##   CD16+ Monocytes                   364      25       61       26       78
##   CD20+ B cells                    1546     427       43       31      417
##   Erythrocytes                        0       0      186     1219       97
##   Erythroid progenitors               0       0      112      249      102
##   HSPCs                              28       0      227      119       99
##   Megakaryocyte progenitors          21      16       72       71       76
##   Monocyte progenitors                0       0      183      109      136
##   Monocyte-derived dendritic cells  182       0       89       60       65
##   NK cells                          756     476       26        0       63
##   NKT cells                        1056     432      389       62      157
##   Plasma cells                       18       0       33       40       38
##   Plasmacytoid dendritic cells       81      11       54       41       38

How does the collection of datasets look without any integration?

Run a standard Seurat pipeline for dimensionality reduction

nfeatures <- 1000
ndim <- 20
object <- FindVariableFeatures(object, nfeatures = nfeatures) %>%
  NormalizeData() %>% ScaleData() %>%
  RunPCA(npcs=ndim) %>% RunUMAP(dims=1:ndim)
p1_pre <- DimPlot(object, group.by = "batch") + theme(aspect.ratio = 1) + ggtitle("Dataset/batch before integration")
p2_pre <- DimPlot(object, group.by = "final_annotation", label=T, label.size = 4) + NoLegend() +
  theme(aspect.ratio = 1) + ggtitle("Cell labels before integration")


p1_pre | p2_pre

Although cells mostly cluster by the cell type (as annotated in individual datasets), there are also visible batch effects (seen as dataset-specific clustering)

Let’s quantify dataset/batch mixing using the LISI metric from the Raychaudhuri Lab, and conservation of biological information by means of Average Silhouette Width (ASW) of cell labels in the corrected PCA space:

integrationMetrics <- list()

if (!require("scIntegrationMetrics", quietly = TRUE))
install_github("carmonalab/scIntegrationMetrics") #calculates LISI and Silhouette
library(scIntegrationMetrics)
lisi_perplexity <- 20

integrationMetrics[["uncorrected"]] <- list()

integrationMetrics[["uncorrected"]][["batch_LISI"]] <- mean(compute_lisi(object@reductions[["pca"]]@cell.embeddings, meta_data = object@meta.data, label_colnames="batch", perplexity = lisi_perplexity)[[1]]) # perplexity=effective number of each cell's neighbors; use lower than default to make it faster

integrationMetrics[["uncorrected"]][["celltype_ASW"]] <- mean(compute_silhouette(object@reductions[["pca"]]@cell.embeddings, meta_data = object@meta.data, label_colnames="final_annotation")[[1]])

integrationMetrics[["uncorrected"]]
## $batch_LISI
## [1] 2.914892
## 
## $celltype_ASW
## [1] 0.036829

We will now apply STACAS for correcting batch effects.

Standard integration

STACAS takes a list of Seurat objects as input, so let’s first split the merged object into a list of individual batches/datasets

obj.list <- SplitObject(object, split.by = "batch")

STACAS Step 1: Find integration anchors

stacas_anchors <- FindAnchors.STACAS(obj.list, 
                                     anchor.features = nfeatures,
                                     dims = 1:ndim)
##  1/10 2/10 3/10 4/10 5/10 6/10 7/10 8/10 9/10 10/10
## Finding integration anchors...

STACAS step 2: Dataset integration

object_integrated <- IntegrateData.STACAS(stacas_anchors, 
                                              dims=1:ndim)

Calculate low-dimensional embeddings and visualize integration results in UMAP

object_integrated <- object_integrated %>% ScaleData() %>% RunPCA(npcs=ndim) %>% RunUMAP(dims=1:ndim)
p1_int <- DimPlot(object_integrated, group.by = "batch") + theme(aspect.ratio = 1) + ggtitle("Dataset/batch after integration")
p2_int <- DimPlot(object_integrated, group.by = "final_annotation", label=T, label.size = 2) + NoLegend() +
  theme(aspect.ratio = 1) + ggtitle("Cell labels after integration") 

p1_int | p2_int

Quantify integration metrics:

integrationMetrics[["STACAS"]] <- list()

integrationMetrics[["STACAS"]][["batch_LISI"]] <- mean(compute_lisi(object_integrated@reductions[["pca"]]@cell.embeddings, meta_data = object@meta.data, label_colnames="batch", perplexity = lisi_perplexity)[[1]])

integrationMetrics[["STACAS"]][["celltype_ASW"]] <- mean(compute_silhouette(object_integrated@reductions[["pca"]]@cell.embeddings, meta_data = object@meta.data, label_colnames="final_annotation")[[1]])

integrationMetrics[["STACAS"]]
## $batch_LISI
## [1] 3.351884
## 
## $celltype_ASW
## [1] 0.2295277

Compared to the non-corrected data, we now observe: i) increase in dataset/batch LISI (i.e. effective number of different batches in a cell neighborhood), indicating a higher mixing ii) increase in Silhouette/ASW, indicating that cells with the same annotation (ie of the same kind) were brought into proximity

One-liner STACAS

The previous integration could have been run from the initial object with a single command:

object_integrated <- object %>% SplitObject(split.by = "batch") %>%
      Run.STACAS(dims = 1:ndim, anchor.features = nfeatures) %>%
      RunUMAP(dims = 1:ndim) 

DimPlot(object_integrated, group.by = "batch")

STACAS integration guide trees

Like Seurat, STACAS uses a guide tree to determine integration order. Optionally, you can check the integration guide tree automatically generated by STACAS (e.g. are samples from the same sequencing technology or from similar tissues clustering together?)

st1 <- SampleTree.STACAS(
  anchorset = stacas_anchors,
  obj.names = names(obj.list)
  )

You can tune or manually re-calculate this tree and pass it to IntegrateData.STACAS:

STACAS step 2 with pre-calculated integration tree

object_integrated <- IntegrateData.STACAS(stacas_anchors, 
                                              dims=1:ndim, 
                                              sample.tree=st1
                                              )

Semi-supervised integration

When available, cell type annotations can be used to guide the alignment. STACAS will use this information to penalize anchors where cell types are inconsistent.

In this dataset, cells were annotated by the authors of the benchmark. For your own data, you may want to do manual annotation or apply one of several cell annotation tools, such as SingleR, Garnett or scGate. We will be posting some examples of cell type annotation using scGate in a different demo.

** One-liner semi-supervised STACAS ** by indicating in cell.labels the metadata column that contains cell annotations

object_integrated_ss <- obj.list %>% Run.STACAS(dims = 1:ndim, anchor.features = nfeatures, cell.labels = "final_annotation" )

Note that there is no need for ALL cells to be annotated: we recommend to set labels to NA or unknown for cells that cannot be confidently annotated, and they won’t be penalized for label inconsistency. In addition, you can decide how much weight to give to cell labels with the label.confidence parameter (from 0 to 1).

Visualize on UMAP space

object_integrated_ss <- object_integrated_ss %>% RunUMAP(dims=1:ndim)
p1_ss <- DimPlot(object_integrated_ss, group.by = "batch") + theme(aspect.ratio = 1) + ggtitle("Dataset/batch after semi-supervised integration")
p2_ss <- DimPlot(object_integrated_ss, group.by = "final_annotation", label=T, label.size = 2) + 
  NoLegend() + theme(aspect.ratio = 1) + ggtitle("Cell labels after semi-supervised integration")

p1_ss | p2_ss

Quantify integration metrics:

integrationMetrics[["semisupSTACAS"]] <- list()

integrationMetrics[["semisupSTACAS"]][["batch_LISI"]] <- mean(compute_lisi(object_integrated_ss@reductions[["pca"]]@cell.embeddings, meta_data = object@meta.data, label_colnames="batch", perplexity = lisi_perplexity)[[1]])

integrationMetrics[["semisupSTACAS"]][["celltype_ASW"]] <- mean(compute_silhouette(object_integrated_ss@reductions[["pca"]]@cell.embeddings, meta_data = object@meta.data, label_colnames="final_annotation")[[1]])

integrationMetrics[["semisupSTACAS"]]
## $batch_LISI
## [1] 3.199746
## 
## $celltype_ASW
## [1] 0.2474989

In this case, by using the semi-supervised approach we observe a gain in biological signal conservation, as measured by cell labels silhouette (celltype_ASW), with a similar dataset/batch mixing (batch_LISI) compared to the unsupervised integration.

Summary of Integration Metrics

if (!require("tidyr", quietly = TRUE))
install.packages("tidyr")

integrationMetricsSummary <- data.frame(unlist(integrationMetrics)) %>% tibble::rownames_to_column() %>% rename(value=unlist.integrationMetrics.) %>% separate(rowname, c("Method","Metric"), sep="\\.")

a <- integrationMetricsSummary %>% filter(Metric=="batch_LISI") %>%
  ggplot(aes(x=Method, y=value, fill=Method)) + geom_bar(stat="identity") + 
  ggtitle("batch_LISI") + theme_bw() +
  theme(legend.position="none", axis.text.x=element_blank())
  

b <- integrationMetricsSummary %>% filter(Metric=="celltype_ASW") %>%
  ggplot(aes(x=Method, y=value, fill=Method)) + geom_bar(stat="identity") + 
  ggtitle("celltype_ASW") + theme_bw() +
  theme(axis.text.x=element_blank()) 
  

a | b

Label transfer between datasets

In some integration tasks, cell type annotations may be available only for a fraction of the datasets. In such cases, labels can be “transferred” to unannotated datasets by similarity to annotated cells. The annotate.by.neighbors function allows propagating cell labels to unannotated cells by K-nearest neighbor similarity to annotated cells.

For example, if one of the datasets were lacking cell type labels (set to NA):

incomplete <- object
incomplete$final_annotation[incomplete$batch == "10X"] <- NA

Transfer labels from annotated to non-annotated cells

complete <- annotate.by.neighbors(incomplete, labels.col = "final_annotation")

a <- DimPlot(incomplete, group.by = "final_annotation", label=T, label.size = 4) + NoLegend() +
  theme(aspect.ratio = 1) + ggtitle("Incomplete annotation")

b <- DimPlot(complete, group.by = "final_annotation", label=T, label.size = 4) + NoLegend() +
  theme(aspect.ratio = 1) + ggtitle("Complete annotation")

a | b

Notes on data integration

  1. The first critical step in batch effect correction is to ensure that the gene names in your datasets are harmonized. If different studies use different naming conventions, synonyms, etc., methods for dimensionality reduction will treat these genes as entirely different variables, and introducing artificial differences between datasets. It is recommended that your pre-processing pipeline takes care of resolving ambiguities in gene naming across datasets. You can use STACAS:::standardizeGeneSymbols but there are other tools such as HGNChelper

  2. The calculation of highly variable genes (HVG) is a fundamental step for dimensionality reduction and integration. We recommend excluding certain classes of genes, e.g. mitochondrial, ribosomal, heat-shock genes, from the list of HVG, as their expression are typically more strongly associated to technical variation than to actual biological differences. The FindVariableGenes.STACAS() function allows providing a list of genes to exclude from HVG; see an example below.

We can use the collection of signatures stored in SignatuR package

if (!require("SignatuR", quietly = TRUE))
    install_github("carmonalab/SignatuR")

library(SignatuR)
print(SignatuR$Hs)
##                     levelName
## 1  Hs                        
## 2   ¦--Blocklists            
## 3   ¦   ¦--Pseudogenes       
## 4   ¦   ¦--HSP               
## 5   ¦   °--Non-coding        
## 6   ¦--Cell_types            
## 7   ¦--Programs              
## 8   ¦   ¦--cellCycle.G1S     
## 9   ¦   ¦--cellCycle.G2M     
## 10  ¦   ¦--IFN               
## 11  ¦   ¦--Tcell.cytotoxicity
## 12  ¦   ¦--Tcell.exhaustion  
## 13  ¦   °--Tcell.stemness    
## 14  °--Compartments          
## 15      ¦--Mito              
## 16      ¦--Ribo              
## 17      ¦--TCR               
## 18      °--Immunoglobulins
#Retrieve full list of signautures for human
hs.sign <- GetSignature(SignatuR$Hs)

Then we tell STACAS to exclude specific genes from HVG used in integration, using the parameter genesBlockList


my.genes.blocklist <- c(GetSignature(SignatuR$Hs$Blocklists),
                        GetSignature(SignatuR$Hs$Compartments))

object_integrated_blockList <- Run.STACAS(obj.list, genesBlockList = my.genes.blocklist,
                                          dims = 1:ndim, anchor.features = nfeatures)
sessionInfo()
## R version 4.3.0 (2023-04-21)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/Zurich
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
## [1] SignatuR_0.1.1           tidyr_1.3.0              scIntegrationMetrics_0.5
## [4] STACAS_2.1.1             ggplot2_3.4.2            dplyr_1.1.2             
## [7] SeuratObject_4.1.3       Seurat_4.3.0.1           remotes_2.4.2           
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3     jsonlite_1.8.5         magrittr_2.0.3        
##   [4] spatstat.utils_3.0-3   farver_2.1.1           rmarkdown_2.22        
##   [7] vctrs_0.6.3            ROCR_1.0-11            spatstat.explore_3.2-1
##  [10] htmltools_0.5.5        BiocNeighbors_1.18.0   sass_0.4.6            
##  [13] sctransform_0.3.5      parallelly_1.36.0      KernSmooth_2.23-21    
##  [16] bslib_0.5.0            htmlwidgets_1.6.2      ica_1.0-3             
##  [19] plyr_1.8.8             plotly_4.10.2          zoo_1.8-12            
##  [22] cachem_1.0.8           igraph_1.5.0           mime_0.12             
##  [25] lifecycle_1.0.3        pkgconfig_2.0.3        Matrix_1.5-4.1        
##  [28] R6_2.5.1               fastmap_1.1.1          fitdistrplus_1.1-11   
##  [31] future_1.32.0          shiny_1.7.4            digest_0.6.32         
##  [34] colorspace_2.1-0       S4Vectors_0.38.1       patchwork_1.1.2       
##  [37] tensor_1.5             irlba_2.3.5.1          vegan_2.6-4           
##  [40] labeling_0.4.2         progressr_0.13.0       fansi_1.0.4           
##  [43] spatstat.sparse_3.0-2  mgcv_1.8-42            httr_1.4.6            
##  [46] polyclip_1.10-4        abind_1.4-5            compiler_4.3.0        
##  [49] withr_2.5.0            BiocParallel_1.34.2    highr_0.10            
##  [52] R.utils_2.12.2         MASS_7.3-60            permute_0.9-7         
##  [55] tools_4.3.0            lmtest_0.9-40          httpuv_1.6.11         
##  [58] future.apply_1.11.0    goftest_1.2-3          R.oo_1.25.0           
##  [61] glue_1.6.2             nlme_3.1-162           promises_1.2.0.1      
##  [64] grid_4.3.0             Rtsne_0.16             cluster_2.1.4         
##  [67] reshape2_1.4.4         generics_0.1.3         gtable_0.3.3          
##  [70] spatstat.data_3.0-1    R.methodsS3_1.8.2      rmdformats_1.0.4      
##  [73] data.table_1.14.8      sp_2.0-0               utf8_1.2.3            
##  [76] BiocGenerics_0.46.0    spatstat.geom_3.2-1    RcppAnnoy_0.0.20      
##  [79] ggrepel_0.9.3          RANN_2.6.1             pillar_1.9.0          
##  [82] stringr_1.5.0          later_1.3.1            splines_4.3.0         
##  [85] lattice_0.21-8         renv_0.17.3            survival_3.5-5        
##  [88] deldir_1.0-9           tidyselect_1.2.0       miniUI_0.1.1.1        
##  [91] pbapply_1.7-2          knitr_1.43             gridExtra_2.3         
##  [94] bookdown_0.34          scattermore_1.2        stats4_4.3.0          
##  [97] xfun_0.39              matrixStats_1.0.0      stringi_1.7.12        
## [100] lazyeval_0.2.2         yaml_2.3.7             evaluate_0.21         
## [103] codetools_0.2-19       data.tree_1.0.0        tibble_3.2.1          
## [106] BiocManager_1.30.21    cli_3.6.1              uwot_0.1.15           
## [109] xtable_1.8-4           reticulate_1.30        munsell_0.5.0         
## [112] jquerylib_0.1.4        Rcpp_1.0.10            globals_0.16.2        
## [115] spatstat.random_3.1-5  png_0.1-8              parallel_4.3.0        
## [118] ellipsis_0.3.2         listenv_0.9.0          viridisLite_0.4.2     
## [121] scales_1.2.1           ggridges_0.5.4         leiden_0.4.3          
## [124] purrr_1.0.1            rlang_1.1.1            cowplot_1.1.1

Further reading

The STACAS package and installation instructions are available at: STACAS package

The code for this demo can be found on GitHub

References

  • Andreatta A., Carmona S. J. (2021). STACAS: Sub-Type Anchor Correction for Alignment in Seurat to integrate single-cell RNA-seq data. - Bioinformatics

  • Luecken, M. D., Büttner, M., Chaichoompu, K., Danese, A., Interlandi, M., Müller, M. F., … & Theis, F. J. (2022). Benchmarking atlas-level data integration in single-cell genomics. - Nature methods

  • Hao, Y., Hao, S., Andersen-Nissen, E., Mauck III, W. M., Zheng, S., Butler, A., … & Satija, R. (2021). Integrated analysis of multimodal single-cell data. - Cell